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ABSTRACT 

Recent advances in psychometrics and computer 
technology encourage the development of model-based methods of 
individualized testing on a microcomputer, where each exami'nee 
receives short tests an«d where the number of pretest items that can 
be administered is severely restricted. On-line (i.e., data is 
collected on operational equipment) methods for calibrating pretest 
items in this setting face new challenges. In the context of adaptive 
testing (CAT), two LOGIST-based methods of on-line calibration that 
were developed are described. These two methods were applied and 
compared in a simulation study using data from the CAT Armed Services 
Vocational Aptitude Battery. One of the methods was superior to the 
other in terms of statistical properties, but the superior method was 
more expensive to implement in terms of the numbers of items required 
and either the numbers of examinees or numbers of items per examinee 
required. Both methods showed scale drift, although the drift was in 
opposite directions for the two methods. Ways to reduce scale drift 
are explored. An appendix contains 4 tables and 22 figures, and 17 
references are attached. (Author/SLD) 
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Abstract 

Recent advances in psychometrics and computer technology 
encourage the development of model -based methods of individualized 
testing on a microcomputer, where each examinee receives short 
tests and the number of pretest items that can be administered is 
severely restricted. On-line (i.e., data is collected on 
operational equipment) methods for calibrating pretest items in 
this setting face new challenges. In the context of adaptive 
testing, two LOGIST-based methods of on-line calibration were 
developed and are described. These two methods are tried out and 
compared in a simulation study using data for the CAT-ASVAB. 

Keywords : Adaptive testing 
Item pools 
IRT scales 
Calibration 
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Introduction 

Conventional modes of testing, where every examinee takes a 
single test form (or parallel versions of it) in large 
administrations using paper and pencil as the delivery and 
recording medium, belong to the past. Recent advances in both 
psychometrics and computer technology seem to point in the 
direction of model-based methods; of individualized testing on a 
microcomputer, where each examinee receives short tests and 
testing is generally done on a demand basis. Much has been 
written about adaptive testing in this latter context. But other 
new methods of testing, for example model-based mastery testing or 
work-sample testing, may also be designed as short tests 
administered by computer to small samples of examinees. 

Traditional methods of estimating parameters for items, 
particularly modei-based parameters as opposed to conventional 
psychometric parameters such as proportions -correct , depend upon 
relatively large numbers of examinees taking relatively large 
numbers of items. This is especially true for the more frequently 
used complex models, such as the 3-parameter logistic model. 
Model-based testing requires, moreover, item parameter estimates 
in order to design tests. The challenge is to develop new methods 
of item parameter estimation (calibration) that capitalize on 
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computerized testing modes without destroying the advantages of 
short computerized tests. These methods are generally referred to 
as 'on-line' calibration methods because the data collected for 
calibration of new items comes from the administration of these 
items on operational equipment as part of an examinee's testing 
session. , 

In some contexts, the acquisition of sufficient numbers of 
examinees for adequate parameter estimation in on-line calibration 
may be problematical. In these situations, it may be possible to 
accumulate a reasonable number of examinees over a realistic time 
frame. If this is not feasible, then on-line calibration methods 
that are more optimal for this situation than those described here 
need to be developed. 

Even if the number of examinees is sufficient, a central 
problem, in any application of on-line calibration, is how to 
obtain adequate calibrations of pretest items when both the length 
of the computerized test is very short and the number of extra 
items that can be administered in pretest mode to an examinee is 
severely restricted. The current context of the particular 
methods of on-line calibration reported here is adaptive testing 
for the Armed Services Vocational Aptitude Battery (ASVAB) . The 
results, however, are more general and apply to other modes of 
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computerized testing that rely on recent advances in psychometric 
theory . 

Remaining sections of this paper will describe the current 
project and its design, two methods oi' on-line calibration based 
on the estimation procedures in the LOGIST computer program 
(Wingersky, 1983) and the efficacy of these methods in the context 
of adaptive testing, and how these and othtr similar on-line 
calibration methods might be improved in the context of adaptive 
testing as well as other types of computerized testing. 

The Current Project 

A number of agencies of the Department of Defense funded a 
three -year project to develop and evaluate different methods of 
on-line calibration for the computerized adaptive Armed Services 
Vocational Aptitude Battery (CAT-ASVAB) . Although exploration of 
general classes of item calibration methods was encouraged in this 
context, final results were constrained to be parameterized as the 
3 -parameter logistic model of item response functions. The final 
phase of this project consisted of a series of 'high-fidelity' 
simulations of adaptive testing and item pool refreshment. The 
simulations are called 'high-fidelity' for two reasons. First, 
the true item response functions and ability densities, both 
developed by Levine (1987) are nonparametric , and therefore more 
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realistic than more restrictive parametric models. Second, the 
study design represents a realistic scenario for on-line 
calibrations for the adaptive ASVAB as well as many other types of 
computerized tests. Two LOGIST-based methods of on-line 
calibration were investigated. As part of the same project, 
Darrell Bock (1987), Michael Levine (1987), and Fumiko Samejima 
(1987) investigated other methods of on-line calibration. 

The results presented here provide some general guidance for 
effective computerized test item pool construction and 
maintenance. In addition, one LOGIST-based method is superior to 
the other in terms of statistical properties. However, the 
superior method is more expensive to implement in terms of the 
numbers of items required, and either the numbers of examinees 
required or the numbers of items per examinee. 

The Study Design 
Figure 1 displays the overall design of the study. Although 
four experimenters participated, only the details of the LOGIST- 
based methods indicated on the right side of Figure 1 will be 
discussed in detail below. Using data provided by Vale 
(Prestwood, Vale, Massey and Welsh, 1985), nonparametric item 
response functions and ability densities were produced by Levine 
(1987) to serve as true item response functions and ability 
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densities. Also using Vale's data, LOGIST was run to provide 
parameter estimates for subsequent use. Items were selected for 
an initial pool and calibrated by each experimenter. Then four 
rounds adaptive test simulations and item pool refreshment were 
conducted. 

The Original Data (Block 1 of Figure 1) 

Vale, et al., developed over 2000 experimental items to be 
considered as candidates for the original ASVAB adaptive testing 
item pool. A subset of these items, developed for the Word 
Knowledge (WK) subtest, was used in this study. The WK subtest 
consists of a single item type, synonyms, and is designed to 
measure the understanding of words typically used in social 
studies and everyday life, human relationships, science and 
nature, and arts and humanities (Prestwood, et al . , 1985). The 
focus of the Vale item development effort was to write and 
calibrate similar items that spanned a wider range of difficulty 
than those found in operational use. 

Vale developed a total of 258 such WK items. For the purpose 
of obtaining item parameter estimates he obtained a sample of 
N - 8171 candidates for military service from Military Entrance 
Processing Stations who had also taken the conventional ASVAB test 
battery. This set of data, that is, the responses of 8171 
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individuals to the 'experimental' Word Knowledge items as well as 
the conventional ASVAB, forms the basis of the current study. 

A calibration of the experimental WK items using the computer 
program LOGIST was performed. This calibration is noted in the 
top block in Figure 1. Table 1 shows summary statistics for the 
item parameter estimates obtained from LOGIST. The items were 
designed to be different from a conventional test aimed at the 
average examinee; the purpose was to obtain items for an adaptive 
test item pool. The items are supposed to be very discriminating, 
with low guessing parameters, and span a wide range of difficulty. 
As seen in Table 1, the items are indeed more discriminating than 
is customarily seen. Although the items span a wide range of 
difficulty, there are more easier items relative to the number of 
harder items. 



Insert Figure 1 and Table 1 about here 



Summary statistics for the ability estimates of the 
calibration sample are also shown in Table 1. Although this 
sample of people was the only one available for the purpose of 
calibrating these experimental items, the sample may not be 
completely appropriate. Examinees were informed that their scores 
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on the experimental iters did not count. It is therefore possible 
that the motivation of individuals was not the same as motivation 
presumably will be in the intended population. 
The Definition of Truth (Block 1 of Figure 1) 

Using the data provided by Vale, Levine produced 
nonparametrir. item response functions and ability densities to be 
used as true item response functions and ability densities in this 
study. Figure 2 shows the true item response functions for some 
typical items. In general, these true item response functions are 
nonmonotonic . 



Insert Figure 2 about here 



In Vale's data collection design, no examinee was 
administered all 258 experimental items. Rather, the items were 
broken up into three blocks of 86 items that were roughly parallel 
to each other. Each block was administered to a subset of 
examinees. Therefore Levine produced three nonparametric ability 
densities. To produce a description of a single ability density, 
Davis (1987) averaged the three densities, and then interpolated, 
integrated, and normalized the resulting function. All simulated 
examinees (simulees) are drawn from the resulting distribution. 
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Initial Item Selection (Block 2 of Figure 1) 

An initial set of 100 items was drawn by Davis from the 258 
experimental items to form the first adaptive test item pool. 
Research has shown (Hulin, Drasgow and Parsons, 1983) that item 
pools much larger than this are not necessary for short adaptive 
tests. To select the 100 items, Davis used the LOGIST parameter 
estimates summarized in Table 1 to compute a table of items sorted 
by their estimated information functions at various levels of 
ability. Items yielding successively decreasing amounts of 
information were selected until 100 unique items were obtained. 

There were two constraints on the process used by Davis. 
First, 10 items that were judged to be excessively non-3PL in 
shape on the basis of observed data were eliminated from the 258 
before selecti on was done. These non-3PL items exhibited either 
severe nonmonotonicities or broad plateaus in the mid- range of 
ability. Second, the ability metric was divided into intervals of 
1.0 between -3.0 and +3.0, and the numbers of items selected from 
each of the two most extreme intervals was constrained to be no 
more that 10, while 20 items were selected from each of the 
remaining intervals. This latter constraint was to control for 
the fact that there are proportionally too many easy items in 
Vale's original item set. 
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The Simulated Initial Calibration (Block 2 of Figure 1) 

Up to this point in the study design, Vale's WK data have 
been used to develop a definition of truth for the simulations, 
and to provide estimated parameters for the purpose of selecting 
the initial 100 item pool. But these parameter estimates are no 
longer appropriate for subsequent steps. They represent 3PL 
estimates of the true item response functions that are 
hypothesized to underlie the responses collected by Vale from live 
examinees. These item response functions are not the same as 
those generated by Levine to be used as the (non-3PL) definition 
of truth for this study. Instead, it is necessary to obtain 3PL 
parameter estimates from data where Levine 's true item response 
functions generate responses to items from simulees. This step is 
a simulated initial calibration of the 100 item pool. 

Davis divided the 100 items into four 25- item subsets, each 
of which had approximately the same estimated test information 
function based on LOGIST parameter estimates. Each of N - 6000 
simulees was administered two subsets of 25 items in an 
overlapping design that provided 3000 simulee responses per item. 
Responses to the items were generated using the true (non-3PL) 
item response functions and simulee abilities. LOGIST was run on 
these data to provide estimated item parameters for the 100 items 
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in the Round 0 adaptive test item pool. These parameters 
estimates were then returned to Davis for the next step in the 
simulation study. 

A Typical Round in the Simulation Study (Blocks 3 through 6, 
Figure 1) 

A typical Round in the simulation study consisted of the 
simulation of an adaptive test (by Davis), the selection and 
seeding of candidate new items for the adaptive test item pool (by 
Davis), the identification of items to be removed (by Davis), and 
the calibration and selection of new items to be included in the 
next Round of simulations (by individual experimenters). Davis 
performed his functions separately by experimenter, and, for the 
results reported here, separately by the two LOGIST-based methods 
of on-line calibration. Thus the items in the pool, the item 
parameter estimates, and adaptive test simulations may differ for 
each experimental method at each Round. This process was repeated 
for a total of four Rounds. 

For each Round, Davis simulated the administration of a 15- 
item fixed- length adaptive test to a sample of N - 15,000 simulees 
drawn at random from the composite ability density. The same 
sample of simulees was used for each experimental method within a 
Round. Owen's (1975) Bayesian procedure was used to update 
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ability estimates during the adaptive test. The next item to be 
administered during the test was chosen to be that item that was 
most informative at the estimated ability, except for the 
imposition of efforts to control item exposure. 'Exposure' 
parameters controlled the probabilities with which an optimally- 
selected item was actually administered to a simulee. For the 
first item, this probability was .2, for the second item the 
probability was .25, for the third it was .33, for the fourth it 
was .5, and for the fifth through the fifteenth, this probability 
was 1.0. 

Item usage data was collected by Davis for each item, and 
accumulated across each Round in the simulations. At each Round, 
the 25 most used items of the current 100 item pool were 
identified as items that must be replaced for the next Round. In 
addition, for each Round, Davis identified a pool of candidate 
'new' items by randomly selecting 50 items from the full set of 
Vale's original 258 items. Within a Round, the same 50-item set 
of candidate new items was used for each experimental Method. 
Response data were collected for these items by administering each 
simulee 5 randomly selected items from this set of 50. On 
average, each new item had about 1500 responses to be used in the 
subsequent estimation of item parameters by each experimenter. 
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Responses to these new items played no role in item selection or 
ability estimation during the course of the adaptive test. 
The Final 'Half Round (Block 7 of Figure 1) 

At the end of four Rounds, the original 100- item pool has 
been 'refreshed' four times. In order to examine the cumulative 
effects of these four Rounds on simulee ability, Davis conducted a 
final half-Round, consisting of just the simulation of an adaptive 
test using this final item pool. 

Two LOGIST-based Methods of On-line Calibration 
A Note on Methods That Do NOT Work 

Lord (1984) described a straightforward approach to on-line 
calibration that did not work. Since the logic underlying this 
approach is so appealing, it is instructive to examine it. The 
basic idea of this approach was to simultaneously calibrate the 
candidate new items and recalibrate the entire adaptive test item 
pool. Since the results of such a calibration would not be on the 
same scale as the original item pool, it would be necessary to 
determine a transformation of these results to that scale. Using 
the relationship between the item parameter reestimates for the 
adaptive test pool and the original item parameter estimates for 
these same items, a suitable scaling transformation could be 
determined by minimizing the difference between the test 
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characteristic curves for the two different sets of estimates, as 
in Stocking and Lord (1983) . After such a transformation had been 
applied, the calibration results for the candidate new items would 
also be on the scale of the adaptive test item pool. This basic 
plan of simultaneous calibration of candidate new items and the 
current adaptive test item pool, along with a transformation, 
would then be repeated for each Round of the simulation study. 

A problem with this approach is that some of the items in the 
adaptive test item pool may not have been administered frequently 
enough to provide adequate data for subsequent recalibration. To 
solve this problem, those items that were infrequently used in the 
adaptive test were administered nonadaptively to a sufficient 
number of examinees for adequate calibration. Thus, in Lord's 
design, there were three types of items to be calibrated, 
distinguishable by the nature of the data available for 
calibration purposes. The first type consisted of candidate new 
items for which only nonadaptive responses were collected. The 
second type consisted of items already in the pool, but whose 
response rate was sufficiently low in the adaptive test that some 
nonadaptive administrations had been performed to increase the 
number of responses per item. The third type consisted of items 
already in the pool, but whose response rate was sufficiently high 
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that only adaptive test responses were required for adequate 
calibration. 

The third type of item, that is, those items for which only 
adaptive responses were obtained, causes severe problems when 
attempting to estimate parameters for the 3PL. Since their 
response rate is high in the adaptive test, these are the more 
discriminating items in the pool. The more discriminating an 
item, the greater the change in the probability of a correct 
answer for small changes in ability close to the item difficulty. 
Since the adaptive test works well, these items are usually 
administered to simulees with very similar levels of ability. 
Thus the distribution of ability for those simulees administered 
this kind of item becomes more concentrated the better the 
adaptive test works. These highly discriminating items divide 
this concentrated ability distribution into two classes: those 
simulees whose ability lies slightly below the item difficulty and 
therefore have only a small probability of responding correctly, 
and those y.imulees whose ability lies slightly above the item 
difficulty and therefore have a high probability of responding 
correctly. The observed data from which item parameters are to be 
estimated are the actual item responses from the simulees. These 
data also tend to be divided into two classes by the highly 
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discriminating item -- those who respond correctly and those who 
respond incorrectly. These responses are not sufficient to 
estimate item discrimination or the lower asymptote. This 
situation is illustrated for a few items in Figure 3 (taken from 
Lord, 1984) , where the boxes represent observed proportions 
correct for particular ability groups and are plotted proportional 
in size to the number of cases in the group. In the extreme case, 
for an infinitely long adaptive test in which the estimated 
ability becomes indistinguishable from the true ability, the 
response data from which three item parameters must be estimated 
collapses into a single point located at the item difficulty, with 
the observed proportion of correct responses halfway between 
chance and 1.0. 



Insert Figure 3 about here 



Calibration procedures such as LOGIST, that do not utilize 
other information that might be available about the item 
parameters, cannot perform adequately on items such as these for 
which only adaptive responses are available. Mislevy (personal 
communication, 1987) conjectures that any other currently 
available estimation procedure would encounter similar 
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difficulties in trying to estimate 2PL or 3PL parameters from 
these adaptive responses, unless these adaptive responses were 
augmented by additional information, such as prior distributions 
on item parameters or responses from examinees over a broader 
range of ability. 

The solution is not to discard the items in this set from the 
calibration and simply proceed with the two remaining sets of 
items. This is equivalent to discarding the better (more 
discriminating) items available and using only the poorer (less 
discriminating) items in the pool as a scale anchor. Instead it 
appears that one solution could lie in the direction of avoiding 
the direct use of adaptive test responses by summarizing the 
information available from the adaptive test when estimating 
parameters for the new items. Both methods described below move 
in this direction. 
Method A 

For this method, adaptive test item responses and parameter 
estimates for items already in the adaptive test item pool are 
used to compute a maximum likelihood estimate of simulee ability. 
This estimated ability serves as a summary of information 
available from the adaptive test. A LOGIST calibration run is 
then performed in which these ability estimates are fixed, and 
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nonadaptive responses to the new seeded items are used to estimate 
parameters for the 50 candidate new items. The top half of Figure 
4 shows the details of this method across all four Rounds, 
starting with Round 1. 



Insert Figure 4 about here 



Because the ability estimates are fixed in the calibration of 
the new items, and no rescaling is done within the LOGIST run, the 
ability estimates determine the scale on which the item parameter 
estimates are reported. Since the ability estimates themselves 
are on the scale of the adaptive test item pool, the estimates for 
the candidate new items will be also. 
Method B 

This method requires a set of 'anchor' items to be seeded, 
along with the new items. The purpose of the anchor items is to 
try to improve upon Method A. Method A depends entirely upon 
treating estimates of ability as if they were, in fact, true 
abilities in order to maintain the scales of subsequent item 
pools. In so doing, errors will be made because estimated 
abilities differ from true abilities. The anchor items will be 
used to attempt to correct for scale drift that may result from 
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the use of these imperfect ability estimates for scale 
maintenance . 

It seems reasonable to select as anchors items that are 
representative of those in the adaptive test item pool in terms of 
difficulty. This is because scale transformations that are 
derived from the anchor items will be applied to all items to be 
considered as candidates for item pools. In addition, the quality 
of the anchor items, in terms of item information, should be no 
worse than typical items produced for the adaptive test item pool, 
since it makes no sense to develop scale transformations based on 
items of poor quality. 

For purposes of this study, 25 anchor items were defined to 
be exactly like 25 items selected from the original 100- item pool. 
The first step in the selection process was to eliminate from 
further consideration 9 items of the 100- item pool that were 
judged to be poorly fit by the 3PL in Round 0. The remaining 91 
items were grouped on the basis of estimated difficulty into 5 
equal intervals between -2.0 and 2.0. Five items were then 
selected randomly from each interval to be the anchor items. 

Nonadaptive responses to 5 randomly selected new items and 5 
randomly selected anchor items were collected from each simulee. 
Since there are half as many anchor items as there are new items, 
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each new item received about 1500 responses and each anchor item 
received about 3000 responses. The alternative design in which 
each simulee received only 5 seeded items, either new or anchor 
items, requires more simulees before an on-line calibration can 
occur, and was judged inconvenient to implement for the purposes 
of this study. Thus, the doubling of response rate for anchor 
items is not a requirement of this Method, but an artifact of the 
chosen design. 

As in Method A, adaptive test information is summarized in a 
maximum likelihood estimate of simulee ability. A LOGIST 
calibration is done, fixing the ability estimates and estimating 
item parameters for the 50 new items and the 25 anchor items. 
Using the reestimates for the anchor items, and their initial 
estimates from Round 0, a scaling transformation is chosen to 
minimize the difference between the two test characteristic curves 
(Stocking & Lord, 1983). Using this tranf ormation the results of 
this LOGIST run are placed on the scale of the Round 0 item pool. 
The bottom half of Figure 4 shows the details of Method B across 
all four Rounds, beginning with Round 1. 

The Selection of New Items 

Although the calibration of new items at each Round differed 
for the two Methods studied, the same algorithm was used to select 
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from the 50 candidate items the 25 new items to be included in the 
next Round's item pool. The difference between the estimated test 
information function for the Round 0 item pool and the estimated 
test information function for the 75 items to remain from the 
current pool was defined as a 'target' information function. The 
use of this difference as a target insures that items will be 
selected to maximize the resemblance of the next Round's item pool 
to the Round 0 item pool. 

Three methods of selecting item sets or 'drafts' to match the 
target information function were tried. The first method selected 
items to minimize the maximum difference between the target and 
the draft test information functions across the ability metric 
from -3.0 to +3.0. The second method selected items with the 
greatest area under their item information functions within 
ability levels that appeared important based on the target 
information function. A third method was a combination of these 
two: a draft set of 25 items was selected on the basis of the 
area under the item information functions and then attempts were 
made to improve on this draft by discarding some items and 
selecting others that minimized the maximum difference between the 
target and the draft test information functions. 
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None of these methods of selecting replacement items worked 
automatically without intervention; the process can best be 
described as more of an art than a science. The 25 replacement 
items (and the selection method) were ultimately chosen on the 
basis of a subjective criterion: item sets with information 
functions closer to the target over middle ranges of ability were 
preferable to item sets with information functions more distant 
from the target in the middle but closer at the extremes . The 
first method was never judged to produce the best set of 
replacement items; it frequently resulted in the selection of 
items with more extreme difficulties over items with more moderate 
difficulties. This has the effect of matching or exceeding the 
target curve at more extreme ability levels where there are fewer 
siraulees and where the reduction in information over the Round 0 
item pool is least. The second method, selecting on the basis of 
the area under the item information functions between fairly 
narrow limits of ability, was most frequently judged the best, 
that is, the information functions were as close as possible to 
the target curve for middle levels of ability. The results of the 
third method were occasionally selected as well. Both of the more 
desirable methods required tinkering with the ability limits 
within whir h a match to the target was desired. 
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Results, Part 1: The True Score Metric 
In examining the results of on-line calibration by any 
Method, it seems important to first focus attention on global 
effects in a metric that approximates one in which examinee scores 
will be reported. A subsequent section will examine in more 
detail the operation of each Method in the IRT metric. 

The number- right true score metric for the Round 0 pool was 
chosen for this series of comparisons because the Round 0 pool is 
the only item pool that is constant across Methods, and, indeed, 
across all other experimenters in this project. In this context, 
the items in the Round 0 pool may be viewed as a "reference" test 
that is common to all experimenters. A true score on this common 
reference test is the score a simulee would have obtained if 
administered the entire Round 0 pool as a conventional test. 

The conditional RMSE between estimated number-right true 
score and true score, and the conditional bias in estimated 
number-right true score were computed for each Method after each 
Round. True scores were computed using true simulee abilities for 
a Round and the true IRF's for the Round 0 items. Estimated 
number-right true scores were computed using simulee abilities 
estimated from the adaptive test at each Round, and the estimated 
IRF's for the Round 0 items. For small intervals of true score, 
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the square root of the average squared difference between the true 
scores and their estimates is considered to be the conditional 
RMSE function. Similarly, the average difference between the 
estimated and true scores for the same small intervals of true 
score is considered to be the conditional bias function. 

The conditionl RMSE and the conditional bias measures contain 
errors attributable to two sources that, because of the study 
design, cannot be examined separately. Both incorporate any scale 
drift due to the sequence of on-line calibrations. Because the 
ability estimates at each Round are from adaptive tests using 
different item pools , these measures also incorporate any effects 
arising from the changes in the item pools across Rounds. 

Figure 5 compares the conditional RMSE and bias for Method A 
across all Rounds of the study, while Figure 6 displays the same 
information for Method B. The conditional RMSE and bias functions 
are the same for both Methods for the first Round. In these 
Figures, the dashed vertical lines mark the quintiles of the 
distribution of true scores for the final half-round of adaptive 
test simulation. In terms of the conditional RMSE, there is 
little difference between Methods at low and high true score 
levels. For middle true score levels, the conditional RMSE for 
Method A increases with every Round of adaptive testing. For 



9 

ERIC 



Scale Drift 
26 

these same true score levels, the Method B RMSE tends to remain 
about the same after an initial increase from the first to the 
second Round of adaptive testing. For middle true score levels, 
where most of the simulees are located, Method B generally has a 
smaller RMSE than Method A. 



Insert Figures 5 and 6 about here 

The changes in the conditional bias functions appear to be 
systematic, although in different ways, across Rounds for both 
Methods. For Method A, the bias becomes more positive across 
Rounds for true scores slightly above the median and more negative 
for low true scores. For Method B the direction of the bias is 
opposite for middle true scores -- higher levels show more 
negative bias and lower levels show more positive bias across 
Rounds. Method B tends to have smaller absolute bias than Method 
A across all Rounds. 

Figure 7 displays the conditional RMSE and bias for both 
methods after the end of all Rounds of adaptive test simulations. 
This Figure clarifies the comparisons of the two Methods. For low 
and high true scores, the RMSE functions are similar; at middle 
true scores, Method B has smaller RMSE. For the very lowest and 
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highest true scores, the conditional bias for the two Methods is 
similar; for the remaining true score levels, the bias works in 
opposite directions. Overall, the absolute bias is smaller for 
Method B. 



Insert Figure 7 about here 



Results, Part 2: The IRT Metric 
The bias results on the true score metric seen in Figures 5, 
6, and 7, may be viewed as scale drift. To interpret these 
results it seems necessary to explore the behavior of the two 
Methods in detail on the IRT metric. The various features of the 
simulations are complex and difficult to examine in a sequential 
fashion. Unfortunately, any written exposition can only proceed 
sequentially rather than simultaneously. 

Many factors concurrently impact the four Rounds of item pool 
refreshment and the four -and -one -half Rounds of adaptive testing. 
Among these are: 

the method of selecting the initial item pool, 

— the correlations among errors of estimation for the item 
parameter estimates, 
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— the adaptive testing paradigm that selects items for 
administration based on an estimated ability and 
estimated item information, 

the use of ability estimates in place of true abilities, 

— the precision with which new items are calibrated, 

the method of identifying items in the current pool for 

replacement , 

the method of selecting replacement items from the new 

items . 

The effort to understand the interactions among these factors 
s complicated by other factors that are not central to on-line 
alibration, but must be dealt with nevertheless. Among these 
omplicating factors are: 

the necessity of establishing a single IRT scale upon 

which comparisons can be made, 

— the comparison of estimated parametric IRF's with true 
nonparametric counterparts, 

the fact that changes in relevant variables between 

consecutive Rounds may be small, necessitating the 
examination of multiple Rounds simultaneously and across 
two different Methods. 
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The information available from each Method for each Round 
will be described in five different but overlapping analyses. The 
first such analysis deals with what, for lack of a better term, 
may be called a 'snapshot' of a single Round. Included in this 
snapshot are data about 1) the item pool at the beginning of the 
Round, 2) the nature of the ability estimates obtained from using 
this item pool to administer adaptive tests, 3) the nature of the 
errors made when using these ability estimates as if they were 
true abilities, and 4) the estimation of the parameters for the 
candidate new items. 

The remaining analyses will look at many of these same 
features summarized in different ways. Given an item pool, the 
second analysis will explore what happens to ability estimation 
when this pool is used to administer adaptive tests. This 
analysis will take place across Rounds within Methods of on-line 
calibration. The third analysis will investigate what happens to 
the calibration of candidate new items, given that the ability 
estimates are used as if they are true abilities. This analysis 
will also take place across Rounds within Methods. The fourth 
analysis will focus in detail on the differences in the 
calibration of new items between Methods across Rounds, with 
particular attention paid to considering separately the effects of 
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the approximate scaling transformation for Method B. Finally, 
given the rules for elimination of items and the method of 
selecting new items for the adaptive test pool, the fifth analysis 
will explore the impact of these across Rounds and within Methods. 
The Establishment of a Single IRT Metric 

Throughout these discussions, comparisons of estimated 
quantities are made with their true counterparts. This, of 
course, is one of the advantages of a simulation study. Before 
this comparison can be made, however, true and estimated 
quantities must be on the same (arbitrary) metric. 

It is unlikely that the metric on which the true item 
response functions and abilities were developed by Levine bears 
any simple linear relationship with the metric of LOGIST 
estimates. However, the assumption is made here that a linear 
transformation can be developed that will be a good approximation 
to a possibly more complex transformation. From the simulated 
initial calibration of the Round 0 item pool we have abilities 
estimated by LOGIST. Davis provided the corresponding true 
abilities for the N = 6000 simulees. Robust measures of location 
and scale (Mosteller & Tukey, 1977, p. 20) were computed for both 
the estimated and true abilities and a linear transformation 
developed from them to transform true abilities to the scale of 
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the Round 0 calibration. This same transformation was applied to 
the true abilities of all other simulees in the remaining Rounds 
of the study. 

All true item response functions were transformed to the 
Round 0 scale by applying this approximate linear transformation 
to Levine's tabled ability values. Figure 8 shows the transformed 
true item response functions (solid lines) and the LOGIST 
estimates of the same item response functions (dotted lines) on 
the Round 0 metric for some typical items. All subsequent 
comparisons with true values were done using the transformed true 
item response functions and abilities. 



Some Results From the Bivariate Normal Distribution 

Some aspects of the subsequent discussions, particularly 
those dealing with true and estimated abilities, may at first 
appear confusing. This section attempts to deal in advance with 
these aspects by removing them temporarily from the context of on- 
line calibration, and placing them in the more familiar context of 
the bivariate normal distribution. The point of the exercise is 
not to imply that the joint distribution of estimated and true 



Insert Figure 8 about here 
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abilities is bivariate normal, but rather to clarify some of the 
techniques used in subsequent analyses . 

Suppose we have two variables whose joint distribution is 
bivariate normal, z^ and z^. Suppose that both variables have 
means of 0 and variances of 1, and that the correlation between 
the two is rho < 1. A scatterplot of 400 random draws from such a 
distribution, restricted to the range of -1 to +1, is shown in the 
left panel of Figure 9. On the same panel, the solid line is the 
45-degree line of z^ = z^. The line with long dashes is the 
regression of z^ on z^, namely z^ = rho * z^. The line with the 
short dashes is the regression of z^ on z^ , z^ = rho * z^ , or z^ - 
(1/rho) * z, . 



Insert Figure 9 about here 



Consider the difference (z^ - z^) . The expectation of this 
residual, conditional on z^, is 

E(z;> - z 1 |z 1 ) = E(z 2 |z ][ ) - z^ ~ rho * z^ - z^ = (rho - 1) * z ][ . 



This function is plotted as a function of z^ in the middle panel 
of Figure 9. Because rho is less than 1, this function has a 
negative slope. Suppose that z 1 is an unobservable variable, and 
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the Z£ is an observable estimate of it. The average value of this 
residual for different values of the unobservable variable is, by 
definition, a conditional bias function. 

Consider the expectation of this same residual, now 
conditional on z^: 

E(z 2 - z 1 |z 2 ) = z 2 - E(z 1 |z 2 ) = z 2 - rho * z^ = (I - rho) * z^. 

This function is plotted as a function of z 2 in the right panel of 
Figure 9. Because rho is less than 1, this function has a 
positive slope. We again suppose that z^ is an unobservable 
variable, and z 2 is an observable estimate of it. The average 
value of this residual for different values of the observable 
variable can be considered to be a conditional error function. 
That is to say, it gives the average of the errors that will be 
made if the observable variable is used as if it were the 
unobservable variable, for different values of the observable 
variable . 

In subsequent sections, the joint distributions of estimated 
and true ability will be analyzed in both ways. The expected 
residuals conditional on the true abilities will constitute an 
examination of the bias in the estimated abilities. These 
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conditional bias functions will usually have a negative slope. 
The expected residuals conditional on the estimated abilities will 
constitute an examination of the error made when using estimated 
abilities rather than true abilities. These conditional error 
functions will usually have a positive slope. 
Analysis 1: Snapshots 

Figures 10 for Method A and 11 for Method B display the 
snapshots of each Round for each Method. Each subfigure (labeled 
10a, 10b,. . . , or 11a, lib,. . . ) of each Figure shows a single 
Round and is intended to be read left to right and top to bottom 
as in a page of printed text. The following section discribes the 
derivation and interpretation of each panel of a subfigure. 
Subsequent sections compare each Round of the two Methods of on- 
line calibration. 



Insert Figures 10a- lOe and 11a- lie about here 



The Panels of a Subfigure 

The first two panels in the top row of a subfigure display 
information about the. item pool with which the Round is begun. 
The first panel shows the differences between the estimated and 
true slopes, plotted as a function of the true slopes; the second 
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panel shows the differences between the estimated and true 
difficulties, plotted as a function of the true difficulties. In 
the second panel, the difficulty of an item for which the slope 
has been overestimated is indicated by a plus, and the difficulty 
of an item for which the slope has been underestimated is 
indicated by a circle. 

In these and all other figures containing item parameter 
information, the 'true' difficulty is defined as that value on the 
true (transformed) ability metric that yields the same probability 
of a correct response as does the estimated difficulty on the 
estimated item response function. The 'true' slopes are computed 
by numerical methods (Hamming, 1962. p. 318) at the location of 
the true difficulty. Thi"- method of obtaining 'true' difficulty 
and slope parameters for the nonparametric true IRF's is flawed. 
Consider an item represented by a particular true nonparametric 
IRF. If we have two different estimates of the difficulty 
parameter for this item, it is possible that this procedure will 
produce two different values of the 'true' difficulty parameter, 
and also two different values of the 'true' slope parameter. To 
examine the magnitude of this problem, the 50 new items calibrated 
in Round 4 were examined. These items are the same for each 
Method, and the difference between the parameter estimates, and 
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therefore between the 'true' parameters determined by this Method, 
should be largest on this final Round. The average difference 
between the 'true' difficulties was found to be .02, and the 
average difference between the 'true' slopes was found to be .004. 
These differences are small. 

This method has an additional disadvantage. In this and 
subsequent figures, not all relevant items may appear. The 
nonparametr ic item response function is described by tabled 
function values over a finite interval of ability, and 
occasionally it is not possible to find the 'true' difficulty and 
slope values within this table. Appendix 1 contains a list, by 
Figure, of all items that should, but do not, appear in plots of 
this nature throughout this paper, along with the values of their 
estimated item parameters. 

In spite of these flaws, this method of determining 'true' 
parameters from nonparametr ic item response functions, although 
crude, seems to to work well enough to give useful insights into 
differences between the two Methods of on-line calibration. 

The third panel in the top row of each snapshot displays the 
median difference between estimated and true ability as a function 
of the median true ability for small intervals of true ability. 
The median difference rather than the average difference is used 
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because some estimated abilities may be outlying values that 
distort averages. This panel may be viewed as a plot of an 
approximate bias function for estimated ability resulting from the 
use of this item pool to administer an adaptive test. 

The first panel on the bottom row of each snapshot displays 
the median difference between estimated and true ability as a 
function of the median estimated ability for small intervals of 
estimated ability. This may be viewed as approximating the errors 
made when using an estimated ability as if it were a true ability. 
For example, in Figure 10a, we see that when we use an estimated 
ability of 2., that this estimated ability is larger than the true 
ability by about .1; when we use an estimated ability of -3, this 
estimated ability is lower than the true ability by about .8. 

The final two panels on the bottom row of each snapshot 
display the residual plots for the estimates of the new items in a 
Round. The estimation of these item parameters is accomplished 
using the estimated abilities as if they were true abilities. 
Round 1 Snapshots 

The first Round of the simulation is represented in Figure 
10a for Method A and Figure 11a for Method B. In these two 
figures, the starting item pool for this Round is the same -- the 
Round 0 pool. Likewise, the bias function and error function for 
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the abilities are the same. Most of the items in the Round 0 pool 
have over estimated slopes (first panel in top row). Items with 
middle difficulty are better estimated than items with more 
extreme difficulties (second panel in top row) . This is because 
these item parameter estimates were obtained from a group of 
simulees with a typical (bell-shaped) distribution of abilities. 
Easier items tend to have overestimated difficulties, and also 
overestimated slopes. Harder items are also less well estimated 
than those of middle difficulty. In addition, overestimated 
harder items tend to have underestimated slopes while 
underestimated harder items tend to have overestimated slopes. 
These relationships between over- and underestimated slopes and 
over- and under estimated difficulties are consistent with the 
Wingersky and Lord (1984) result of a positive correlation between 
errors of estimation for item discriminations and difficulties for 
easy items and a negative correlation for hard items. These 
correlations are perhaps emphasized here because the distribution 
of ability in the sample of simulees was not broad enough to 
obtain accurate parameter estimates for the more extreme items. 

Adaptive tests were administered to simulees using this Round 
0 item pool. An ability estimate was first determined, based on 
responses to items previously administered. During the actual 
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operation of the adaptive test, this ability estimate was computed 
using Bayesian methods developed by Owen (1975) . For both Methods 
of on-line calibration, a maximum likelihood ability estimate, 
rather than Owen's Bayesian ability estimate, was used. Both the 
Owen's Bayesian ability estimate and the maximum likelihood 
ability estimate weight items with higher estimated 
discriminations more than items with lower estimated 
discriminations . 

In the adaptive test, items that were maximally informative 
at the estimated ability were then selected for administration. 
The information for an item is itself an estimate, and will be an 
overestimate of true information for items with overestimated 
slopes. For items of approximately the same difficulty, the items 
with the most overestimated slopes will be chosen more frequently 
than items with less overestimated slopes. Because of the 
correlation between errors of estimation for difficulty and 
discrimination, easy items with overestimated difficulties will be 
chosen more frequently than easy items with less overestimated 
difficulties. Likewise harder items with underestimated 
difficulties will be chosen more frequently than harder items with 
less underestimated difficulties. The missestimation of 
difficulties will be emphasized by methods of estimating ability 
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that weight items with higher estimated slopes more than items 
with lower estimated slopes. 

We can expect these speculations to be confirmed by the bias 
plot of estimated abilities. Indeed, the bias in the ability 
estimates (third panel of top row in Figures 10a and 11a) does 
follow the same pattern as that of the difficulties of the items 
with overestimated slopes (middle panel of top row in Figures 10a 
and 11a) . 

In the next step of Round 1 for both Methods, the ability 
estimates are used as if they were true abilities and the item 
parameters are estimated for the candidate new items. We know 
that estimated abilities have a larger variance that true 
abilities. Because the unit of measurement of the IRT scale is 
usually taken to be the standard deviation of abilities, this 
difference in variance may be viewed as a difference in scale. 
Table 2 displays the variance of the true and estimated abilities 
for all samples of simulees for all Rounds. While the differences 
in the Round 1 variances may seem quite small, a better picture of 
the nature of the errors incurred is found in the error function 
panel (first panel of bottom row in Figures 10a and 11a). Higher 
estimated abilities are systematically overestimates of their 
corresponding true values, and the difference becomes larger as 
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the estimated ability increases. The same is true in the opposite 
direction for lower estimated abilities. 



Insert Table 2 about here 



Up to this point in Round 1, the two Methods of on-line 
calibration have been identical. It is in the calibration of the 
new items that the Methods begin to differ. The spreading out of 
the estimated abilities and their subsequent use as if they were 
true abilities causes the slopes to be generally underestimated 
for the new items for Method A. (middle panel, bottom row, in 
Figure 10a) . The difficulties for the new items are not as well 
estimated as the difficulties for the Rend 0 pool; the scatter of 
the residuals is greater. In addition the trend towards the 
overestimation of difficulty for harder items and underestimation 
of difficulty for easy items is consistent with the direction 
predicted on the basis of correlation between estimation errors 
for items with generally underestimated slopes. 

The calibration of new items for Method B (second and third 
panels, bottom row, Figure 11a) is less affected by the use of 
estimated abilities as if they were true abilities. About half of 
the items have overestimated slopes. The difficulties for the 
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items with overestimated slopes are missestimated in a pattern 
consistent with the correlation of estimation errors. The 
difficulties for the items with underestimated slopes seem less 
affected. The items are the same for both Methods and the 
parameters are estimated from the same set of estimated abilities. 
What is different is the scaling transformation that is performed 
for Method B, using information from the anchor items. This has 
the effect of an approximate correction for the fact that using 
the estimated abilities to obtain parameter estimates is not the 
same as using true abilities. From this point on, the two Methods 
of on-line calibration will differ, and the differences will 
become more notable with each Round. 
Round 2 Snapshots 

Figure 10b for Method A and Figure lib for Method B display 
the second Round of simulation results. Items have been discarded 
from the Round 0 pool on the basis of the frequncy of use in 
adaptive test simulations. Because the simulees are a typical 
group, the most frequently used items were the discriminating 
items of middle difficulty. Replacements were selected from the 
new items calibrated in the previous Round (Figures 10a and 11a, 
bottom row, third panel) . These items spanned a broad range of 
difficulty. All items of middle difficulty were selected as 
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replacements, as well as more extreme items with higher estimated 
slopes . 

The Method A pool for Round 2 contains more items with 
underestimated slopes than the Method B pool. .The well estimated 
middle difficulty items from Round 0 have been eliminated for both 
Methods, and replaced by items whose difficulties are less well 
estimated. For Method B, and also, but less clearly, for Method 
A, the pattern of residuals for the item difficulties for items 
with overestimated slopes is consistent with the correlation of 
estimation errors; the trend is less clear for items with 
underestimated slopes. As in Round 0, the bias in the estimated 
abilities follows the bias in item difficulties for those items 
with overestimated slopes. 

For both Methods, the variance of estimated abilities is 
greater than the variance of true abilities, although this 
difference is smaller than the differences for Round 1, as shown 
in Table 2. But, also for both Methods, the pattern of errors 
made when using estimates as if they were true shows systematic 
errors are now made, even for middle levels of estimated ability. 
It seems probable that this phenomenon is the result of the fact 
that the item difficulties are less well estimated for both 
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Methods for middle difficulty items as well as the patterns of 
missestimation of slopes and difficulties peculiar to a Method. 

For Method A, estimated abilities just above the middle are 
overestimates, while estimated abilities just below the middle are 
underestimates. Method B shows the same error pattern for 
estimated abilities just above the middle, and a slightly less 
severe underestimation for estimated abilities just below the 
middle. For Method A, then, the estimated abilities are more 
spread out than for Method B, even for middle ability estimates. 
This difference in patterns is a consequence of the 
underestimation of slopes for almost all new items selected for 
inclusion in this Round 2 pool by Method A. 

When using these estimated abilities to estimate parameters 
for new items, as before the slopes for Method A are mostly 
underestimated while those for Method B are less so. The 
correlation of estimation errors is now clearly visible for Method 
A for those items with underestimated slopes. It is also clearly 
visible for Method B for those items with overestimated slopes. 
Snapshots for Rounds 3, 4, and 5 

The remaining subfigures of Figures 10 and 11 show the 
results for the remaining Rounds. All of the phenomena examined 
so far appear and become more exaggerated as the Rounds progress : 
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1) The starting pool for a Round for Method A always contains 
more items with underestimated slopes than the pool for Method B. 
As the Rounds progress, the slopes for the Method A pools become 
mostly underestimated. For Method B, the lumbers of over- and 
underestimated slopes become about the same. 

2) The middle difficulty items are the items that are 
continually replaced for both Methods. For Method A, it becomes 
more visible that the estimates of these difficulties follow 
patterns predicted by the correlated estimation errors for items 
with underestimated slopes. Patterns for overestimated slopes are 
less discernable for this Method, presumably because there are so 
few items with overestimated slopes. For Method B, the estimates 
of the difficulties for the new items with overestimated slopes 
follow patterns predicted by the correlated estimation errors. 

The prediction of patterns for items with underestimated slopes is 
not very visible. 

3) For middle levels of ability, the bias in the ability 
estimates follows the bias in the difficulties of the items that 
are most used in the adaptive test. For Method B, this tends to 
be the middle difficulty items with overestimated slopes. For 
Method A, this tends to be the middle difficulty items with 
underestimated slopes, simply because there are few, if any, 
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Method A items calibrated on-line with overestimated slopes. 

These biases tend to be in opposite directions, predictable on the 

basis of the correlated estimation errors when the items were 

calibrated on-line. 

4) Table 2 shows that, across Rounds, the estimated abilities 

are more spread out than the true abilities. This difference is 
generally larger for Method A than for Method B, and for Method A 
it increases across Rounds. The Figures show that for middle 
levels of estimated ability, the errors made when using estimated 
ability as if it were true become increasingly large in opposite 
directions around the middle for Method A. They remain about the 
same for Method B. This reflects fact that the slopes of items 
for Method A are consistently underestimated, as well as the 
missestimation of middle item difficulties for both Methods. 
Analysis 2: Item Pools and Bias in Ability Estimates 

This analysis is offered as an aid in understanding a single 
aspect of the data presented in the snapshots of each Round. 
Figure 12 for Method A and Figure 13 for Method B display for each 
Round the item pool with which the Round is begun, and the 
approximate bias functions for the estimated abilities that result 
when the item pool is used in adaptive testing. These Figures are 
constructed by lining up the top row of three panels from each 
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subfigure in Figures 10 and 11. Each row in the Figures 12 and 13 
represents a Round in the simulation. The top rows in the two 
Figures are the same; they both display the Round 0 item pool and 
the bias functions for the first Round of the simulation. 
Although no new information is presented here, it may be easier to 
comprehend the trends across Rounds when the panels are arranged 
in this manner. 



Insert Figures 12 and 13 about here 



Looking down the left hand column in Figure 12, it is easy to 
see that, for Method A, the point cloud representing the residuals 
of the slopes is mostly above the horizontal line in Round 1, 
gradually drifts downward, and is mostly below the horizontal line 
in the final Round. For Method B in Figure 13, this drift seems 
to stabilize at about the point where the slopes are evenly over 
and underestimated. 

The middle column of panels in the two Figures shows that the 
changes in the residuals of the difficulties are predominantly for 
middle difficulty items. These are the items that get used most 
frequently, and therefore replaced most frequently. For Method A, 
these items become overestimated if they are slightly above the 



ERIC 



5x 



Scale Drift 
48 

middle and underestimated if they are slightly below the middle. 
Most of these items have underestimated slopes. For Method B, 
these middle difficulty items have slopes that are both over- and 
underestimated. 

The third column of panels in the two Figures shows that the 
bias in the estimated abilities tends to remain the same for 
extreme abilities, regardless of the Method of on-line calibration 
or the particular Round. The bias of middle estimated abilities 
changes because the middle difficulty items are being replaced. 
For Method B, the bias tends to look like the bias in the 
difficulties for items of middle difficulty that have 
overestimated slopes. This is because it is just those items that 
are selected most frequently and weighted most heavily for 
simulees of middle ability levels in adaptive testing. For Method 
A this bias also tends to look like the bias for items of middle 
difficulty, because these are the only items available for 
simulees of middle ability levels. Most, if not all, of these 
items have underestimated slopes. 

Analysis 3: Errors in Ability Estimates and the Calibration of 
New Items 

This analysis is offered as an aid to understanding a second 
aspect of the data presented in the snapshots of each Round. 
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Figure 14 for Method A and Figure 15 for Method B display the 
approximate error functions when using estimated abilities as if 
they were true abilities, and the resulting residual plots for the 
new items calibrated from these estimated abilities. These 
Figures are constructed by lining up the bottom three panels from 
each snapshot subfigure across Rounds. The left panel in the top 
row of each Figure is the same; it displays the error function for 
Round 0. All remaining panels differ. 



Insert Figures 14 and 15 about here 

For Method A, the left column of panels in Figure 14 shows 
that the estimated abilities are increasingly spread out across 
Rounds. This is confirmed by Table 2. The middle column of 
panels shows chat more new items have underestimated slopes across 
Rounds; this is because the abilities become more spread out. The 
right panels show the correlation between estimation errors for 
items with underestimated slopes becomes increasingly pronounced 
across Rounds. This is not surprising in view of the greater 
number of items with underestimated slopes. 

For Method B, Table 2 and the left column of panels in Figure 
15 show that the spread in estimated abilities does not change 
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much across Rounds. The middle column of panels shows that the 
slopes for the new items are about evenly over- and 
underestimated. The right column of panels shows that correlation 
between errors of estimation for items with overestimated slopes 
becomes more visible across Rounds. 

It is possible for individual item parameter estimates to be 
different but the estimated item response functions produced by 
these estimates to be similar. This is particularly true for very 
easy and very hard items, where individual item parameter 
estimates are not well determined; quite different parameter 
estimates can give similar item response functions in the area of 
the ability distribution where simulees are located. It would be 
instructive to compare estimated item response functions for both 
Methods across Rounds. Unfortunately, only a single item (Vale 
number 244) was included in the Round 0 pool and was also included 
in the 50 candidate new items for each Round for each Method. 
This item was a discriminating item with higher than average 
difficulty. Figures 16 (Method A) and 17 (Method B) show the true 
item response function (solid line) and the estimates of this item 
response function across all Rounds. Method A estimates show more 
variability, both in terms of slope and location, than do Method B 
estimates. Method A estimated slopes tend to be too low and 
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estimated difficulties too high. Method B estimated slopes tend 
to be too high and estimated difficulties too low. 



Insert Figures 16 and 17 about here 



While it is not possible to directly compare the estimated 
item response functions for any other item across Rounds and 
Methods, it is possible to approach issues of the accuracy of 
estimation more indirectly. Using the true abilities at a Round 
as weights, a weighted RMSE becween the estimated and true item 
response functions was computed. At each Round, an average 
weighted RMSE was then computed, where the average was taken over 
the new items calibrated in that Round. Table 3 shows these 
average weighted RMSE' s for all Rounds for both Methods. The 
Round 0 average weighted RMSE is the same for both Methods since 
the items and estimates of item parameters are identical. The 
average weighted RMSE for Method A increases across subsequent 
Rounds, while the average RMSE for Method B remains approximately 
the same . 



Insert Table 3 about here 
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Analysis 4: The Calibration of New Items and the Approximate 
Scaling Transformation 

The Round 0 item pool is, by design, identical for both 
Methods of on-line calibration. After the first Round the actual 
items included in a pool can be different for the two Methods, 
also by design. The differences are introduced by the elimination 
of overused items and the selection of the 25 new items to be 
included in the next item pool, processes that operated 
independently for the two Methods. In actuality, all items of the 
100 in the final pool created for Round 1 were the same, 96 out of 
100 were the same for Round 2, 95 were the same for Round 3, and 
90 were the same for Round 4. Since the pools are so similar for 
the two Methods, the differences between the two Methods appear to 
originate with the approximate scaling transformation developed 
from the anchor items for Method B. This analysis will examine 
the effects of this rescaling in more detail. 

The multiplicative and additive constants of each rescaling 
are shown in Table 4. All multiplicative constants are less then 
one, indicating that at each Round, the scale unit before 
transformation is too large. All additive constants are non-zero, 
indicating that the scale origin before transformation is not 
precisely correct. The importance to attach to the differences 



ERIC 



Scale Drift 
53 

among the multiplicative constants or the differences among the 
additive constants is unknown but could be investigated through a 
jackknife study. 



Insert Table 4 about here 



Note that the rescalings depend upon one another in a complex 
and indirect fashion. The rescaling for Round 1 is applied to the 
new items calibrated in that Round, of which 25 are selected to be 
included in the second item pool. To the extent that abilities in 
the second Round of adaptive testing are estimated using the items 
that have been calibrated on-line in Round 1, the calibration of 
the new items in the second Round will be affected by the 
rescaling from the first Round. Since the 25 items selected are 
of middle difficulty, it is certain that some abilities are 
estimated from some items that have been calibrated on-line. This 
same type of dependency propagates across Rounds . 

Figure 18 shows the residuals for the estimated slopes and 
difficulties of the new items calibrated in each Round by each 
Method. Each subfigure (18a, 18b, etc.) shows a different Round, 
and within a Round, the items calibrated are the same for each 
Method. In each subfigure, the panels in the top row show the 
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results for Method A. The panels in the middle row show the 
results for Method B before the application of the approximate 
scaling transformation. The panels in the bottom row show the 
results for Method B after the application of the approximate 
scaling transformation. 

For Round 1 (Figure 18a), Method B before transformation 
(middle row) is identical to Method A (top row), that is, the 
items are the same and the estimated parameters are also the same. 
As seen before, the slopes are generally underestimated. Also as 
seen before, the underestimation of difficulties for easy items 
with underestimated slopes and the overestimation of difficulties 
for hard items with underestimated slopes is apparent. The 
sampling correlation for items with overestimated slopes is not 
apparent. The effect of the transformation (bottom row) is to 
reduce the underestimation of the slopes. The overestimation of 
easy items with overestimated slopes and the underestimation of 
hard items with overestimated slopes is now visible, while the 
sampling correlate or items with underestimated slopes 
disappears . 



Insert Figures 18a -18d about he-re 
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For Round 2 (Figure 18b) and all subsequent Rounds, Method B 
before transformation (middle row) is no longer identical to 
Method A (top row), that is, the items are the same for a Round, 
but the parameter estimates are different. This is because the 
estimated abilities used to calibrate the new items have been 
affected by the different parameter estimates for the items 
included in the final pool from the previous Round. Across 
Rounds, Method B untransf ormed slopes generally have less scatter 
than Method A slopes, although they are generally underestimated, 
as are the Method A slopes. Method B untransf ormed difficulties 
also have less scatter than Method A difficulties, and as for 
Method A, the sampling correlation for items with underestimated 
slopes is visible. 

Across Rounds, the application of the scaling trans format ion 
has the effect of raising slopes, so that there are about equal 
numbers of slopes that are over- and under -estimated. 
Simultaneously, the correlation between errors of estimation for 
slopes and difficulties for items with overestimated slopes is 
enhanced, and the correlation between errors of estimation for 
slopes and difficulties for items with underestimated slopes is 
suppressed . 
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The middle panels of Figures 18b, 18c, and 18d show the 
estimation of parameters for the new items for Method B is 
improved when compared to Method A even before the scaling 
transformation is applied for the current Round. This improvement 
can only occur indirectly through the improvement of the ability 
estimates (derived from the adaptive tests that used the previous 
Round's final item pool) used to calibrate these new items. An 
alternative method of examining the improvement is shown in Table 
3. In parentheses, the average RMSE for the new items in each 
Round for Method B before the application of the scaling 
transformation is shown. The RMSE's are smaller than those for 
Method A, and also smaller than those for Method B after 
transformation. The scaling transformation has the effect of 
reducing the bias in the estimated item parameters, at the expense 
of increasing the overall RMSE. But the RMSE after transformation 
for Method B is still smaller than the RMSE for Method A. 
Analysis 5: The Impact of Elimination and Selection Rules on Item 
Pools 

The snapshots of each Round in Figures 10 and 11 show that 
cycles of on-line calibration and item pool refreshment have most 
impact on items of middle difficulty in the item pool. The 
analysis presented here examines these results from a more global 
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perspective in terms of the estimated test information functions 
for the 100- item pools developed during each Round of the study. 

The estimated test information function (Lord, 1980, equation 
5-6) of a set of items is the maximum amount of information that 
can be obtained from the item set if it were administered as a 
conventional test. It is not the information function for an 
adaptive test using this item set as the item pool. The adaptive 
test information function for algorithms of the type used here can 
only be conveniently estimated from a numerical approximation 
using Monte Carlo results (see, for example, Lord, 1980, section 
10.6). This latter information function depends upon not only the 
item pool, but also the algorithms incorporated in the adaptive 
test that deal with the selection of the first item, randomization 
to increase item security, the selection of the next item to be 
administered, the stopping rule, and the scoring method. In this 
discussion, the escimated test information function will be viewed 
as a convenient mechanism for examining the information structure 
of the item pool upon which the adaptive testing algorithm will 
operate . 

Figure 19 for Method A and Figure 20 for Method C display the 
estimated test information functions for the 100-item pools 
developed during each Round of the study. The estimated test 
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information function for Round 0 is the same for both Figures. 
For both Methods, the information in the pool for middle levels of 
ability declines from Round 0 to Round 1; subsequent declines are 
less severe. Across Rounds, the information at low ability levels 
does not change very much and actually increases over that for 
Round 0 for some items. The information at high ability levels 
continues to decline, particularly for Method A, although it 
actually increases somewhat for the very highest ability levels. 

Insert Figures 19 and 20 about here 



These results are due to the interaction of a number 
circumstances that are visible in the snapshots of Figures 10 and 
11. First, the method of building the Round 0 pool makes it look 
better than it actually was, since the items were selected on the 
basis of estimated information. This estimated information came 
from parameter estimates derived from a LOGIST calibration of 
Vale's original response data, rather than the Levine -based 
simulated data used to obtain parameter estimates in Round 0. To 
the extent that Levine' s true IRF's and ability distribution are 
adequate representations of.' the phenomena generating the Vale 
data, we would expect the calibrations in Round 0 to be similar to 
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the calibrations based on the Vale data. Indeed, the first panel 
of Figure 10a or 11a show that most Round 0 slopes are 
overestimated. 

Second, the decline in estimated test information for middle 
levels of ability is also a consequence of the rule for 
eliminating items at each Round. In adaptive testing with a 
typical group, it is the informative middle difficulty items that 
will be selected most frequently for administration. Since items 
are eliminated on the basis of frequency of use, these items will 
be the first to be removed as Ihe process goes from Round 0 to 
Round 1. This phenomenon was apparent in the snapshots. Since 
frequency data is accumulated across Rounds, informative easy and 
difficult items may be removed in subsequent Rounds. 

Replacement items are selected on the basis of estimated 
information from a set of 50 items that have been randomly 
selected from the original Vale pool of 258 items. These items, 
as seen in the snapshots, have a broad range of difficulty. For 
both Methods, the estimated difficulty for items of true middle 
difficulty are less well estimated than Round 0 items of middle 
difficulty. For Method A, their slopes are generally 
underestimated, while for Method B, the slopes are about evenly 
over- and underestimated. Since replacement items are selected on 
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the basis of estimated information, replacement items selected for 
Method A will have lower estimated information than replacement 
items selected for Method B. 

Figure 21 for Method A and Figure 22 for Method B show the 
estimated test information functions for the target curve, the 50 
new items, and the final selection of 25 replacement items for 
Round 4. These results are typical of other Rounds. A random 
selection from Vale's items cannot provide adequate numbers of 
informative middle difficulty items as possible replacements. For 
both Methods, the 25 replacement items have as much information as 
possible at middle levels of ability, given the inadequacy of the 
50- item set from which they are selected. The decline in 
estimated test information in Figures 19 and 20 is less for easier 

items because the Vale item set contains proportionally more easy 

items rather than hard items. 



Insert Figures 21 and 22 about here 

For both Methods, about 30% of the Round 4 item pool consists 
of items remaining from the Round 0 pool, all with estimated 
difficulties greater than 1.0 in absolute value. These items have 
been available for administration to N - 60000 simulees by the end 
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of Rcund 4, but have not accumulated sufficient responses to be 
among the 25 most used items at any Round. A different but 
overlapping 30% of the Round 4 item pool for each Method consists 
of items with fewer than 1000 (and sometimes no) responses. Most 
of these items have estimated difficulties greater than 1.5 in 
absolute value. 

Discussion 

Of the two methods of on-line calibration studied here, 
Method B would be more expensive to implement than Method A. More 
items must be obtained initially from which to select not only the 
first adaptive test item pool, but also the set of anchor items. 
Because the anchor items must also be seeded, either each examinee 
must receive more seeded items, or more examinees are required to 
receive seeded anchor items before an on-line calibration can 
occur. In any practical application of this Method, the anchor 
items themselves must also be replaced from time to time, 
requiring, at every stage, a larger number of items to be 
considered as candidate new items. 

Although Method B is more expensive than Method A to 
implement, some of its properties are more desirable than Method A 
properties. The evidence shows that relying on ability estimates 
alone to maintain the scale across Rounds in on-line calibration 
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leads to increasingly inaccurate estimation of the parameters of 
the new items, which in turn, leads to increasingly inaccurate 
ability estimates (Figures 14 and 18) . Even the approximate 
rescaling used here with the anchor items improves scale 
maintenance (Figures 15 and 18) . The conditional RMSE on the true 
score metric of the Round 0 item pool (Figures 5 and 6) shows 
these effects on a metric that approximates what might actually be 
used for the CAT-ASVAB. Both Methods show an initial increase in 
RMSE af t. r the first item pool refreshment. However, the RMSE for 
Method B remains about the same for subsequent Rounds, while that 
for Method A increases, particularly for middle true scores, 
across Rounds . 

Both LOGIST-based Methods show scale drift over four Rounds 
in which about 70% of the original item pool is replaced (Figure 
7). The drift is in opposite directions for the two Methods. 
Method A tends to underestimate slopes because estimated abilities 
are more spread out than those of Method B. Because of the 
correlation between errors of estimation, Method A underestimates 
difficulties for easier items and overestimates difficulties for 
harder items (Figures 14 and 18) . Estimated abilities reflect 
this bias in the difficulties (Figure 12), as do estimated true 
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scores (Figure 5) . Thus the bias is positive for moderately high 
true scores and negative for moderately low true scores. 

Method B tends to estimate slopes better because abilities 
are better estimated since the items used to compute the abilities 
have been subjected to the approximate scaling transformation 
developed from the anchor items. However, this Method 
overestimates more slopes, and because of the correlation between 
errors of estimation, this Method tends to overestimate 
difficulties for easier items and underestimate difficulties for 
harder items (Figures 15 and 18). Estimated abilities reflect 
this bias in difficulties (Figure 13) , as do estimated true scores 
(Figure 6). Thus the bias is negative for moderately high true 
scores and positive for moderately low true scores. 

Based on their respective biases at the lower ranges of true 
scores, Method A might exclude too many minimally qualified 
recruits; it underestimates their scores. Method B might include 
too many minimally qualified recruits; it overestimates their 
scores. At upper true score ranges, Method A gives an overly 
optimistic reporting of scores, while Method B gives an overly 
pessimistic reporting. 

The causes for this scale drift are known and seem in large 
part attributable to the correlation among errors of estimation. 
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However, these effects appear to be exaggerated by the currently 
accepted procedures for adaptive testing. These correlations are 
not the result of the particular estimation method used here 
(LOGIST) , but are common to all item response models and 
associated estimation methods that assume that an item response 
function is a member of a family of symmetric curves. They become 
important in applications of IRT that rest heavily on the veracity 
of item level data, such as adaptive testing. 

The problems begin with the inadequate calibration of the 
items from which the initial item pool are to be selected. By 
design, this is typically an unusual set of items that are very 
discriminating, have a wide range of difficulty, and low guessing 
parameters (see Table 1). Item parameter estima. as for this 
collection of items are usually obtained by administering these 
items to a typical group of examinees. This sample of examinees 
is inadequate for good estimation of the more extreme items. Thus 
the estimation errors, and their correlations, are exaggerated for 
just those items that are important in building an adaptive test 
item pool. 

The problems are magnified by the accepted method of building 
an initial adaptive test item pool by selecting items on the basis 
of estimated information. This guarantees that the items with the 
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most overestimated discriminations and missestimated difficulties 
will be selected. Figure 10a or Figure 11a show that the 
correlations between estimation errors are already present in the 
Round 0 item pool. 

Further damage is done when items are chosen for 
administration in an adaptive test on the basis of estimated 
information. Because of the correlation between errors of 
estimation, the items with the most overestimated discriminations 
and therefore most missestimated difficulties will be chosen, 
insuring that the difficulties will be, on average, inappropriate 
for the examinee. This missestimation is already present when 
using the Round 0 item pool for adaptive testing, as seen in 
Figures 10a or 11a. Scoring an adaptive test with an algorithm 
that weights items with high estimated discrimination more than 
items with low estimated discrimination must further magnify this 
missestimation of abilities. Both the maximum likelihood 
estimator used in the on-line calibrations and the Bayesian modal 
estimator used in the adaptive test, administration have this 
property . 

Abilities estimated from adaptive test administrations to a 
typical group are biased because of the biases in the estimates of 
item difficulties (Figures 12 and 13), and contain more error than 
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the true abilities (Figures 14 and 15) . These estimates are used 
to determine the parameters of the new items. The new items are 
unscreened, and typically cover a wide range of difficulty. Only 
a few of these items are effective replacements for items that are 
eliminated (Figures 21 and 22). If no scale adjustments are made, 
item discriminations become underestimated, with predictable 
missestimation of difficulties (Figure 18). If an approximate 
rescaling is employed, discriminations are better estimated, but 
also with predictable missestimation of difficulties (also Figure 
18) . If items are selected for inclusion in the pool on the basis 
of estimated information, the correlation of errors of estimation 
is exaggerated once again. Over subsequent cycles of on-line 
calibration, these problems continue to grow. 

The following suggestions for improving the process seek to 
retain the spirit of conventional wisdom in adaptive testing, 
although an unthinking application of this conventional wisdom is 
appropriate only when true item parameters and abilities are know. 
It is possible that some of these suggestions might mitigate the 
inevitable scale drift in on-lir.e calibration. As always, there 
seems to be a choice of mitigating bias by giving up precision, or 
increasing test length. 
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Suggestion #1 

Any method of on-line calibration should begin by obtaining 
the best possible estimates of item parameters for the items from 
which the item pool will be selected. The current design could be 
improved by the use of a more appropriate sample of examinees. If 
the items have a wide range of difficulties, a sample with a wide 
range of abilities is required for the adequate calibration of the 
more extreme items. When employing a LOGIST-based approach to on- 
line calibration, it may be possible to correct these estimates 
for bias, at least approximately, using formulas appropriate for 
the simultaneous estimation of item and ability parameters. 
Formulas for this situation do not exist, but could possibly be 
derived using methods similar to Lord and Wingersky (1985). 
Suggestion #2 

Selecting the initial item pool on the basis of estimated 
information only exaggerates the correlation of estimation errors. 
A better method would stratify on estimated item difficulty alone 
and select randomly within each difficulty stratum. This reduces 
the correlation between estimation errors for item discrimination 
and difficulty. Alternatively, items could be eliminated using 
only conventional criteria such as item content, or poorly behaving 
distractors , and all other items accepted. This method, too, 
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would reduce the correlation. Both of these methods might degrade 
the real quality of the item pool over the entire range of 
ability, requiring a longer adaptive test to achieve the same 
accuracy of estimated ability. An additional alternative 
available in a LOGIST-based approach would be to choose only those 
items that are well estimated, thus eliminating extreme items and 
saving middle difficulty ones. While methods exist for computing 
the appropriate standard errors (see Lord and Wingersky, 1985) 
these methods are difficult and cumbersome to implement. 
Suggestion #3 

The selection of items for administration in an adaptive test 
is typically based on estimated information or estimated 
discrimination. However, we know that for given levels of 
estimated difficulty, drawing items on estimated discrimination 
implies that on average estimated difficulty is not equal to the 
true difficulty because of the correlation of estimation errors. 
A better design would select items for administration on the basis 
of estimated difficulty alone. Even if the errors are correlated, 
this increases the chance that, for given levels of estimated 
difficulty, on average the estimated parameters are equal to their 
true values. This design, although mitigating the correlations 
among estimation errors, degrades the adaptive test over the 
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entire range of interest, requiring the administration of more 
items to achieve similar levels of accuracy. Alternatively, it 
may be possible to develop a confidence interval approach to 
estimated information functions and select items with narrow 
confidence bands around the estimated item information function. 
This latter method has the advantage of possibly maintaining the 
quality of the item pool in the middle of the ability range, while 
at the same time eliminating more extreme and more poorly 
estimated items. This is similar to Suggestion #2, choosing items 
with well -estimated parameters; here we consider items with well- 
estimated functions of parameters. 
Suggestion #4 

Adaptive test scoring algorithms should not weight items with 
high estimated discrimination more than items with low estimated 
discrimination. Both maximum likelihood and Bayesian procedures 
have this property. Use of these estimators will undo the efforts 
to use true item parameters on average. Instead, new scoring 
methods, such as those suggested by Lewis (1985) or Jones, Wainer, 
and Kaplan (1984), that take into account errors in item parameter 
estimates when estimating ability, should be considered. 
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Suggestion #5 

The selection of items to be seeded for on-line calibration 
should be closely tied to algorithms for eliminating items in the 
current pool. Reasonable screening procedures based on classical 
item statistics, if applied to items before they are considered 
for seeding, should improve the maintenance of the quality of the 
item pool by providing more useful new items from which to select 
replacements. Replacement items should be selected from the 
candidate new items using methods that reduce the correlation 
amoung errors of estimation, as in Suggestion #2. 

Consideration shot.. Id be given to eliminating items with too 
few responses in addition to too many responses. This provides 
more room in the item pool for middle difficulty items, at the 
expense of poorer measurement properties at more extreme levels of 
ability. While this may appear to be antithetical to the purposes 
of adaptive testing, it may be cost effective to concentrate 
efforts on maintaining the information structure of the item pool 
in regions where most of the examinees are to be found. 

While the above suggestions for improving cn-line calibration 
have been couched in terms of adaptive testing, the principles 
behind them are also relevant to on-line calibration for other 
model -based testing paradigms that emphasize shcrt computerized 
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tests. Any such test design, by virtue of the fact that test 
lengths are short and the seeding of potential additional items is 
restricted, may have to rely on imperfect information about 
ability to maintain the scale across different versions of the 
item pool. Thus, the accuracy of the initial item calibration, 
the method of selecting items for the initial pool, methods of 
selecting items for administration, and test scoring should all be 
designed to mitigate, as much as possible, the effects of the 
correlated errors of estimation. Careful attention should be paid 
to methods that insure that on average the estimated parameters 
are equal to the true parameters. Such methods will improve the 
accuracy of the estimated ability used to maintain scales across 
Rounds of item pool refreshment, and thereby reduce scale drift. 
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Table 1 

LOGIST Parameter Estimates of Vale's Experimental WK Items (n - 258) 
Taken by Armed Forces Inductees (N - 8171) 



Parameter 


N 


Mean 


S 


D. 


Min 


Max 


10 


Percentiles 
25 50 75 


90 


A 

6 


8171 


-0.00 


1 


06 


-7.26 


5.16 


rl. 22 


-0.65 


-0.04 


0.63 


1.34 


A 

a 


258 


1.25 


0 


47 


0.26 


2.50 


0.66 


0.88 


1.23 


1. 54 


1.88 


A 

b 


258 


-1.08 


2 


00 


-6.10 


3.45 


-3.58 


-2.42 


-1.33 


0.52 


1.76 


A 

C 


258 


0.15 


0 


08 


0.01 


0.45 


0.10 


0.11 


0.14 


0. 19 


0.28 
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Table 2 

Variance of Estimated and True Abilities, Both Methods, 
across All Rounds of Adaptive Testing* 



Method A Method B 

differ- a differ- 

N var(fl) var(fl) ence N var(fl) var(fl) ence 



Round 1 14,979 
(Round 0 pool) 


1. 


15 


1. 


05 


.10 


14,979 


1. 


15 


1. 


05 


.10 


Round 2 14,965 
(Round 1 pool) 


1, 


09 


1 


04 


.05 


14,968 


1 


10 


1 


04 


.06 


Round 3 14,979 
(Round 2 pool) 


1 


16 


1 


05 


.11 


14,976 


1 


11 


1 


05 


.06 


Round 4 14,969 
(Round 3 pool) 


1 


13 


1 


02 


.11 


14,979 


1 


07 


1 


03 


.04 


Round 5 14,965 
(Round 4 pool) 


1 


.19 


1 


03 


.16 


14,976 


1 


.04 


1 


.03 


.01 



*Simulees with all-wrong or all-right response patterns are excluded from 



the computations of both var(ff) and var(0) . 
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Table 3 

Average (across Items) Weighted RMSE for the 
New Items Calibrated in Each Round, for Each Method 



(Method B 

Method A Untransf ormed Method B Number 
New items in Average RMSE Average RMSE*) Average RMSE of items 



Round 0 .023 (.023) .023 100 

Round 1 .025 (.025) .026 50 

Round 2 .027 (.026) .026 50 

Round 3 .029 (.024) .027 50 

Round 4 .031 (.024) .025 50 



*Computed prior to the application of the approximate scaling 
transformation for Method B; see Analysis 4 for explanation. 
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Table 4 

Parameters Developed for the Approximate Scaling Transformation, 
Based on the Anchor Items, for Method B 



Multiplicative Additive 



Round Constant Constant 



1 .921 -.021 

2 .897 -.034 

3 .897 -.016 

4 .917 -.027 
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Item responses on 
WK experimental Items 



Vale 



true IRF's, n - 258 
true ability densities 



Levlne 



LOG I ST parameter 
estlaates, r. - 258 
N - 8171 



Block 1 



Stocking 



Select 100 Items for initial pool. 
Simulate administration of 
conventional tests. 



Davis 



Round 0: 



Simulate initial 
calibration of 
n - 100 Item pool 



Block 2 



Bock 



I Levlne 



Samej lma | 



Stocking 



T 



Simulate adaptive test. Seed items. 
Identify items to be removed. 



Davis 



Round 1 of adaptive testing 

- use Round 0 pool 

- prepare Round 1 pool 
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I Bock 



I Levlne 
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Figure 1. Global design of 'hlgh-f ldellty' simulation study for the ONR On-line 
Calibration Project. 
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Figure 2. 'True' nonparametric item response functions for some 
typical items. 
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Figure 3. Estimated and observed ilem response functions from 
simulated adaptive test responses. 
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Figure 8. Typical Item response functions estimated by LOGIST (dotted 
lines) and the corresponding true Item response functions 
(solid lines). Both estimated and true IRF's are on the scale 
of the Round 0 pool. 
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Figure 12. For all Rounds of Method A, the residuals for the slopes and 
the difficulties for items in the starting poo] and the 
approximate bias in the estimated ability. See text. 
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FiRure 13 For all Rounds of Method B, the residuals for the slopes and 
the difficulties for items in the starting pool and the 
approximate bias in the estimated ability. See text. 
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Figure 14. For all Rounds of Method A, the approximate error 1 unction for 
the estimated abilities and the residuals for the slopes and 
difficulties for the new items. See text. 
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For all Rounds of Mi- 1 hod B, 'tin- approximate error function for 
the estimated abilities and the residuals for the slopes and 
difficulties for the new items. See tex'. 




Figure 16. Method A, item 244, true IRF and estimated IRF's from 
each Round. 
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Figure 17. Method B , item 244, true IRF and estimated IRF's from 
each Round. 
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Figure 18a. The Round 1 residuals for the slopes and difficulties for new items 
for Method A (top), Method B before scaling (middle), and Method B 
after scaling (bottom). See text. 
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Figure 18b. The Round 2 residuals for the slopes and difficulties for new items 
for Method A (top), Method B before scaling (middle), and Method B 
after scaling (bottom) . See text. 
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Figure 18c. The Round 3 residuals for the slopes and difficulties for new items 
for Method A (top), Method B before scaling (middle), and Method B 
after scaling (bottom). See text. 
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Figure 18d. The Round 4 residuals for the slopes and difficulties for new items 
for Method A (top), Method B before scaling (middle), and Method B 
Q after scaling (bottom). See text. 
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Figure 19. Method A estimated test information functions for the 100- item 
pools at each Round of simulations. 
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Figure 20. Method B estimated test information functions for the 100- item 
pools at each Round of simulations. 
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Figure 21. Method A estimated test information functions for the target 
the 50 new items and the 25 replacement items, Round 4. 
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Figure 22. Method B estimated test information functions for the target 
the 50 new items and the 25 replacement items, Round k. 
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Appendix 1. Parameter estimates for items not included in Figures becnu.' 

'true' parameters could not be developed from the tabled 
values of the true item response functions 
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